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details of their po t-transiaiion jI modificaiions. 2-D gel databases are besinninr to l,c 
linked to or integrared with comprehensive protein and nucleic acid databases 
(Neidhardt ei aL 1989; Simpson et aL 1992: Appel et aL. 1994j. and 'organism'" 
databases, containing DNA sequence data, chromosomal map locations, reference 2- 
D gels and protein functional information for an organism, are becoming established 
as genome and proteome projects progress (VanBogelen et aL. 1992: Yeast Protein 
Database cited in Garrels e; aL. 1994). 



GEL IMAGE ANALYSIS AND REFERENCE GELS 

After 2-D electrophoresis and protein visualisation by staining, fluoroeraphv or 
phosphorimaging, images of gels are digitised for computer analysis by an imace 
scanner, laser densitomer. or charge-coupled device (CCD) camera (Garrels. 1989: 
Celis et aL. 1990a: Urvvin and Jackson. 1993). All systems digitise eels with a 
resolution of 100-200 mm. and can detect a wide range of densities or shadine (256 
or more grey scales'). Following this, gel images are subjected to a series of mani- 
pulations to remove vertical and horizontal streaking and background haze, to detect 
spot positions and boundaries, and to calculate spot intensity {Figure 3). A standard 
spot (SSP) number, containing vertical and horizontal positional information, is 
assigned to each detected spot and becomes the protein's reference number. Table 2 
lists some notable software packages which process 2-D gel images. 



Table 2: Some Software Packages for the Analysis of Gel Images. 



Gel image Analysis System References* 



ELSIE & 5 Olscn and Miller. IVS8: Wmh a aL. \W 1 ; Wmh a at.. 1»W, 

GELLAB I 6l II Wu. Lcmkin and Upton. 1993: Lemkin. Wu and Upton. IVSM; 

Mynck et aL. 1993. 

MELANIE I A: II Appel. vial. 1991; Hochstrasser a at. 199lh. 

QUEST I 6i II and PDQUEST Garrels. 19X9: Monardo ct aL. 1994. Hull r; aL. 1 992: Cells «■/ ,// 

I990a.b. 

TYCHO 6i K£PLAR Anderson a aL. 19X4. Richardson. Horn and Anderson. I9W4 



' These references arc not exhaustive: the\ include some references ot use a< well as author ol the 
\\ <i c in 



As there are difficulties in the electrophoresis of samples with 1009? reproducibil- 
ity, reference gel images are often constructed from many geN of the same sample 
(Garrels and Franza. 1989; Neidhardie/a/.. 1989). Since this involves the matching of 
2000 to 4000 proteins from one gel to another, it presents a considerable challenge to 
image analysis systems. Matching of gels is usually initiated by an operator, who 
manually designates approximately 50 or so prominent spots as 'landmarks* on eels 
to be cross-matched. Proteins which match are then established around landmarks, 
using computer-based vector algorithms to extend the matching over the entire gel. 
Close to 1007r of spots from complex samples can be matched by these methods, 
although different degrees of operator intervention may be required (Olsen and Miller, 
1988: Lemkin and Lester. 1989; Garrels. 1989; Myrick et aL. 1993). 
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Figure 3. Computer processing of gel images. Shown is a wide pi ranee 2-D separation of human liver 
proteins, processed hy Melanie software f Appcl cral.. 1991 ). (A) Original gel image as captured by laser 
densitometer. (B> Gel image after processing to remove streaking and background. (C) Outline definition 
of all spots on the gel. 
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Progress w iih proivomv projects 2^ 
CALCULATION OF PROTEIN ISCILECTRh" POINT AND MOLECULAR WEIGHT 

Estimation of the isoelectric point ipl; and molecular weight (MW) of protein^ from 
2-D gel> p;ovides fundamental paranieters for each protein, which uic also of use 
during identification procedures (see fallowing section). The pi and MVV of proteins 
are recorded in 2-D gel databases. Accurate estimations of protein pi and M\V can be 
obtained by using 20 or more known proteins on a reference map 10 construct standard 
curves of pi and molecular weight, which are then used to calculate estimated pi and 
MVV of unknown proteins (Neidhardt et aL. 19S9; Garrets and Franza. 19S9; Yan- 
Bogelen. Hutton and Neidhardt. 19%; Anderson and Anderson. 1991; Anderson et 
at.. 1991; Latham et al.. 1992). Alternatively, the MW of individual proteins blotted 
to PVDF can be determined very accurately by direct mass spectrometry < Eckcrskorn 
et ul.. 1992). Where immobilised pH gradients are used, the focusing position of 
proteins allows their pi to be measured within 0.15 units of that calculated from the 
amino acid sequence ( Bjellqvist et al.. 1 993c ). It must be noted, however, that proteins 
earning post-translational modifications may migrate to unexpected pi or MW 
positions during electrophoresis (Packer et al. 1995). 

SPOT QUANTITATION AND EXPRESSION ANALYSIS 

A major challenge faced in proteome projects is the quantitative analysis of proteins 
separated by 2-D electrophoresis. The most accurate means of protein quantitation is 
to determine chemically the amount of each protein present by amino acid com- 
positional analysis. However, the current method of choice for quantitative analysis 
of many proteins is to radiolabel samples with [ :< S] methionine or U C amino acids, 
perform the 2-D electrophoresis, and measure protein levels in disintegrations per 
minute (dpm) or units of optical density. Quantitation is achieved either by liquid 
scintillation counting, or by gel image analysis where spot densities arc quantiiaied 
bv reference to gel calibration strips containing known amounts of radiolabeled 
protein or against the integrated optical density of all spots visualised ( Vandekerkhove 
et aL. 1990; Ceiis et al., 1990b; Celis and Olsen. 1994; Garrels. 1989; Latham. 
Garrels and Solter. 1993; Fey et al.. 1994). All approaches effectively allow spots to 
be normalised against the total disintegrations per minute loaded onto the gel. 
Limitations that remain with radiolabelling methods are that absolute quantitation is 
noi achieved because all proteins have varying amounts of any amino acid, and that 
only easily labelled samples can be investigated. Quantitative silver staining presents 
an alternative (Giometti et al.. 1991; Harrington et al.. 1992; Rodriguez et al. 1993; 
Myrick et al. 1993). which when undertaken with pSJthiourca (Wallace and Saluz. 
1992 a.b) is of extremely high sensitivity. 

When protein spots from samples prepared under different conditions arc quantitated 
and matched from gel to gel. it becomes possible to examine changes and patterns in 
protein expression. Large scale investigation of up- and down-regulation of proteins, 
their appearance and disappearance, can be undertaken. For example, simian vims 40 
transformed human keratinocytes were shown to have 1 77 up-regulated and 58 down- 
regulated proteins compared to normal keratinocytes (Celis and Olsen. 1 994 ); detailed 
synthesis profiles of 1 200 proteins have been established in 1 to 4 cell mouse embryos 
( Latham et al.. 1 99 L 1 992 ): and 4 proteins out of 1 97 1 were found to be markers for 
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cadmium tonicity in urinary proteins (Myrick ei al.. 1993). Complex global changes 
in proiein expression as a result of gene disruptions have also been investigated ( S. Fey 
and P. Mos» -Larsen. Personal communication). Impressively, large gel sets showing 
proiein expression under different conditions can be globally investigated using 
Mat stical n ethods that find groups of related objects within a set. For example, the 
REF52 rat ell line database, consisting of 79 gels from 1 2 experimental groups where 
each eel contains quantitative data for 1 600 cross-matched proteir.v has heen analysed 
hv cluNter analysis (Garrels et aL. 1990). This revealed clusters of proteins that, for 
example, v ere induced or repressed similarly under simian virus 40 ov adenovirus 
tranNforma'.ion. suggesting a common mechanism. Protein groups thai were induced 
or repressed during culture growth to confluence were also found. It is ob\ ions that the 
potential for investigation of cellular control mechanisms by these approaches is 
immense. It is equally clear that investigations of gene expression of this scale are 
currently technically impossible using nucleic-acid based techniques. 

Table 3: Some proicomc database* and their special features 



Proieome database 



Special features 



References 



£. coit ecne-proicin database 



Human heart databases 



Human keraunoevtc datahase 



Mou^e embrvo database 



Mouse hver database 
( Arvronne Protein 
Mapping Croup) 
Rat 1 1 \ cr epithelial datahase 

Rji liver datahase 



REF 52 rat cell line database 



SWISS-2DPAGE containing 
liuman rctcrcn.ee maps 



Veast Protein Datahase (YPD) 
and Yeast Elcctrophorctic 
Protein Database (VEPD) 



Gci spot> linked with GcnBank 
and Kohara clones; quantitative 
spot measurements under differ- 
ent growth conditions 

Identification of disease markers; 
two separate database* have 
been established 

Extensive identifications; 
quantitative spot measurements 
of transformed cells: identifica- 
tion nl disease markers 

Quantitative spot 
measurement* through 

I to 4 CCil SIJL'C 

Documents chances due to 
exposure to loni/ing radiation 
and iomc chemicals 

Detailed subcellular 
fractionation studies 

Extensive studies on regulation 
ol proteins b\ drugs and toxic 
agents 

Accessible via World Wide Web: 
quantitative spot measurements 
under diflcrcnt conditions 

Accessible via World Wide Web: 
complcicK intceraicd with 
SWISS-PROTand 
SWISS-3DIMAGE 

Complete l> crossrcfcrcnccd 
organism database: YPD has 
extensive information on over 
35<X> proteins: VEPD has 
manv identifications 



VanBogclcn and Ncidhardt. 199 1 : 
YanBocclcn ci al.. 1992 



Baker vi aL. 1992 
Corbett rial.. 1994b 
Jungblut vi til.. I9W 

CcU^ctal.. 199<)a 
Cell- <■/«/.. 1993 
Cch- and Olsen IW 

Latham a al.. 1991 
Latham rial.. W2 

Giometti. Tav lor and TollaUen. I ^>2 



Winn a al.. 1991 Winh n aL. 1^3 

Anderson and Anderson. 1991 ; 

Anderson vial.. 1992: 

Ricturdson. Horn and Anderson. 1^4 

GarrcU and Fran/a 19X9 
Bouieli rial.. 1994 

Appcl vial.. 1993 
Hocrwtrasser a aL. 1992 
Hu?ln> rial.. 1993 
Goia/ a al.. 1993 

GarreN vial.. 1994 



proieir 
tnfonr.. 
Z-D lv 
suhccli 
of relc 
shoulJ 
Macint 
the arc 
annoia: 
sequcMi 
One 
SWISS 
IW; 
feature 
2DPAi 
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FEATURES OF PROTEOME DATABASES 



Proteome projects rely heavily on computer databases to store information about all 
proteins expressed by an organism. "Proteome databases' should contain detailed 
information of protein* already characterised elsewhere. a> well as protein data from 
2-D eels such as apparent pi and MW. expression level under different conditions, 
subcellular localisation, anc 1 information on post-translational modifications. Images 
of reference 2-D gels, shewing protein SSP numbers and protein identifications, 
should also be included. Ideally, proteome databases should be accessible with 
Macintosh or IBM persona' computers and easy to use. Some proteome databases and 
the areas thev cover are Tsted in Table 3. Databases range from collections of 
annotated eels to large databases of images integrated with protein and nucleic acid 
sequence banks. 

One example of an integrated proteome database is the suite of SWISS-PROT. 
S WISS-2DPAGE and SWISS-3DIM AGE databases ( Appel ei <//.. 1 993; Appel ei <//.. 
1994: Appel. Bairoch and Hochstrasser. 1994; Bairoch and Boeckmann. 1994). The 
features of these three databases are listed in Table 4. SWISS-PROT. SWISS- 
2DPAGE and SWISS-3DIMAGE are accessible through the World Wide Web 



Table 4: The SWISS-PROT. SWISS-2DPAGE and SWISS-3DIMAGE suite of crosshnked databases 
All three databases are accessible through the World Wide Web. at URL addros: hup:// 
cxpasy.hcusc.ch/ 

SWISS-PROT SWISS-ZDPAGE SW1SS-3DIMAGE 



Information Text entries of sequence data; 
Citation information: 
taxonomic data: 38. 303 
entries in Release 29 



Annotations Protein function: 
Post translations! 
modifications: 
Domains: 

Secondary structure: 
Quatcrnar\ structure: 
Diseases associated 
with protein. 
Sequence conflicts 

Crosv SWISS-2DPAGE 
Referenced SWISS-3DIMAGE 
Databases EMBL: PIR: PDB: 
OMIM: PROSITE: 
Medline: Flyhasc: 
GCRDb: MaizeDB: 
WonnPcp: DictvDB 

Other Features Navigation to other 

SWISS datahascs achieved 
by selecting entries with 
computer mouse 



2-D gel image* of: human 
liver, plasma. HcpG2. HcpG2 
secreted proteins, red blood cell, 
lymphoma, cerebrospinal fluid, 
macrophage like cell line, 
crythrolcukemia cell, platelet 

Gel images where 
protein is found: 
How protein identified: 
Protein pi and MW: 
protein number: 
normal and pathological 
variant* 



SWISS-PROT and all 
other databases 
accessible through 
SWISS-PROT * 



Gel images show position 
of identified proteins, or 
region of gel where protein 
should appear 



Collection of 330 3-D 
images of proteins 



All annotation is 
available in SWISS- 
PROT 



SWISS-PROT and all 
other databases 
accessible throuch 
SWISS-PROT * 



Mono and stereo 
linage^ a\ailablc: 
Images can be 
translcrrcd to local 
computer image 
viewing programs 
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(Berners-Lee e\ ai. 1992). allowing any computer connected to the internet to access 
the stored information and images. Navigation within and bctweer the three database 
is seamless, as all potential crosslinks are highlighted as hypertext on the displav and 
car be selected with a computer mouse. From these databases, detailed information 
abc ut a protein, including amino acid sequence and known post-rranslational modifi- 
cations, can be obtained, the precise protein spot it corresponds 10 on a reference eel 
irruge can be viewed if known, and the 3-D structure of the molecule can be seen if 
available. References to nucleic acid and other databases are also given to provide 
access to information stored elsewhere. 

Organism' databases, containing detailed protein and nucleic acid information 
ab<»ut a species, are becoming common as genome and proteome project proeress. 
The »e differ from nucleic acid or protein sequence databases like GenBank or S WISS- 
PROT because they are image based, and contain information about chromosomal 
map positions, transcription of genes, and protein expression patterns. The Es- 
cherichia coli gene-protein database (VanBogelen. Hutton and Neidhardt. 1990; 
VanBogelen and Neidhardt. 1991. VanBogelen et n/.. 1992). known as the 
EC02DBASE. is one example. It contains gene and protein names. 2-D gel spot 
information (including pi and MW estimates, and spot identification), genetic infor- 
mation (GenBank or EMBL codes, chromosomal location. location on Kohara clones 
(Kohara. Akiyama. and Isono. 1987). transcription direction of genes), and protein 
regulatory information (level of protein expression under different growth regimes, 
member of regulon or stimulon). All entries in the EC02DBASE are also cross- 
referenced to the SWISS-PROT database (Bairoch and Boeckmann. 1994). It is 
anticipated that organism databases will soon become a standard means of storing all 
available information about a particular species. However there is currently no 
consistent manner in which organism databases are assembled, which ma\ hamper 
comparisons in the future. 

Identification and characterisation of proteins from 2-D gels 

The number of proteins identified on a 2-D reference map determines its usefulness as 
a research and reference tool. As most reference maps have only a small proportion of 
proteins identified, a major aim of current proteome projects is to screen manv proteins 
from 2-D maps, in order to define them as 'known* in current nucleic acid and protein 
databases, or as 'unknown'. Protein identification assists in confirmation of DNA 
open reading frames, and provides focus for DNA sequencing projects and protein 
characterisation efforts by pointing to proteins that are novel. Since there may be 
3000—4000 proteins from a single 2-D map that require identification, the challenge in 
protein screening is to identify proteins quickly, with a minimum of cost and effort. 

Traditionally, proteins from 2-D gels have been identified by techniques such as 
immunoblotting. N-terminal microsequencing. internal peptide sequencing, 
comigration of unknown proteins with known proteins, or by overexpression of 
homologous genes of interest in the organism under study ( Matsudaira. 1 987; Roscnfeld 
ciai. 1992; VanBogelen ct ai.. 1992; Celis a at.. 1993; Honore etai. 1993;Garrcls 
a ai. 1994). Whilst these techniques arc powerful identification tools, they are too 
expensive or time and labour intensive to use in mass screening programs. A 
hierarchical approach to mass protein identification has been recently suggested as an 
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Table ?: Hierarchical analysis for mass screening of 2-D separated protein, htmicd i.. niemhnne 
Rapid and inexpensive lec ;niques arc used as a first step in protein idennf.cai.on. and «|„ wc , ,„'„,."" 
expensive iechniques are men used if nci:essar> Tahle modified fi >m Wasmcer a al . 1005, 1 



Order 


Identification technique 


References 


1 


Amino acid ana ysis 


Junchlut etui. IW92;Sh;n\. 

Hobohm. Houthacu* and .Sander. 1^4; 

Juncblut ciai.. I'N4. \ViU,, n . <7 «/.. 1^5 




Amino acid a^ul .si< with N-termina! sequence lac 


Wikins ct a!., suhmitied 




Pcpnde-mas^ firecrprimine 


Hcn7d eta!., IW. Pappm. Hmrup nd 
Bleach). IW. Jamo <■/ <i/.. Iw^.V 
Mann. Hojrup and RocpMorM. im»v 
Votes era/.. I^.Mort/ ,/,,/.. jgyj 
Sutton et a/.. 1995 


4 


Combination of ammo acid analvsis and peptide 


v. oruw CM ct al.. 199?; 




mass fingerprinting 


*»aMnEcr ci (ti., ivy?. 


5 


Mass spectrometry sequence tag 


Mann and Wilm. 1994 


z. 
0 


Extensive N-tcrminal Edman microsequencinc 


Matsudaira. 19S7 


7 


Internal peptide Edman microsequencing 


Roscnfeld ct al.. 1992; 
Hcliman ct ai. 1995; 


8 


Microseouencing by mass spectrometry (electro- 


Johnson and Walsh. 1992 




spray 1 on 1 sat ion. post -source decay MALDI-TOF) 




9 


Ladder sequencing 


Banlct-Joncs ct «/.. IVM4 



alternative to traditional approaches ( Tahle 5: Wasingere/ al.. 1 995 ). This involves the 
use of rapid and cheap identification tools such as amino acid analysis and peptide 
mass fingerprinting as first steps in protein identification, followed by the use of 
slower, more expensive and time consuming identification procedures if ncccssarv. In 
the construction of this hierarchy the analysis time, cost per sample and the complexity 
of the data created has been considered, as whilst some techniques require little 
machine time per sample, the analysis of data can be quite involved and time 
consuming. Amino acid analysis and peptide mass-fingerprinting based identification 
techniques in the hierarchy are discussed in detail below. For review of other protein 
identification techniques in Table 5. see Patterson ( 1994) and Mann (1995). 



PROTEIN IDENTIFICATION BY AMINO ACID COMPOSITION 

There has been a revival of interest in the use of amino acid composition for 
identification of proteins from 2-D gels after early work by Eckerskorn et al. (1988). 
This technique uses a protein's idiosyncratic amino acid composition profile in order 
io identify it by comparison with theoretical compositions of proteins in databases. 
The amino acid composition of proteins can be determined by differential metabolic 
radiolabelling and quantitative autoradiography after 2-D electrophoresis (Garrels ei 
al.. 1994: Frey ei al.. 1994). or by acid hydrolysis of membrane-blotted proteins and 
chromatographic analysis of the resulting amino acid mixture (Eckerskorn ei al.. 
1988: Tous*r 1989: Gharahdaghi ?/<?/.. 1992: Jungblut etai. 1992: Wilkinse/^ 
1995). As differential metabolic labelling experiments require X-ray film or phos- 
phor-image plate exposures of up to 140 days, and can only be undertaken with easily 
radiolabeled samples, the technique is not as rapid or widely applicable as chromato- 
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Figure 4. Computer printout from ExPASy server where ihc empirical amino acid composition, 
estimated pi and MW of a protein Irom a 2-D reference map of £. t oli were matched against all entries in 
SWISS -PROT for£. < nh The correct identification, aspartate carhamov hransi erase, in shown in hold. Lou 
scores indicate a sood match. Note how matching within a defined pi and MW range i lower sei ol proteins i 
lias ureal K increased the score diflcrcncc hetween the first and second ranking proteins Tins score 
diticrencc uives htL'h confidence in the idem ifi cat ion. and is only onsen ed where the top ranking protein 
ts the correct identification (Wilkins a «/.. 1995 1. 

ijruphy-based analysis. Proteins blotied to PVDF membranes can he hydrolysed in I h 
at 155'C. amino acids extracted in a single brief step, and each sample automatically 
derivatised and separated by chromatography in under 40 minutes (Wilkins ct aL. 
1 995: Ou ct aL. 1 995 ). ln this manner, one operator can routinely analyse 1 00 proteins 
per week on one HPLC unit. This technology lends itself to automation, and it is 
anticipated that instruments with even greater sample throughput will be developed. 
When proteins have been prepared by micropreparative 2-D electrophoresis (Hanash 
ct aL. 1991: Bjellqvist ct aL. 1993b). blotted to a PVDF membrane and stained with 
amido black, any visible protein spot is of sufficient quantity for amino acid analysis 
iCordwell ct aL. 1995: Wasinger ct aL. 1995: Wilkins ct aL. 1995). 

After the amino acid composition of a protein has been determined, computer 
programs are used to match it against the calculated compositions of proteins in 
databases ( Eckerskorn ct aL. 1 988: Sibbald. Sommerfeldt and Argos. 1 99 1 : Jungblut 
ct aL. 1992: Shaw. 1993: Hobohm. Houthaeve and Sander. 1994: Wilkins ct aL. 
1995). Matchins is usually done with only 15 or 16 amino acids, as cysteine and 
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Figure 5. A PVDF protein spot from an £. tali 2-D reference m;ip was sequenced for -i c\clcs. and the 
same sample ihen subject to amino acid analysis. The N-tcrminal sequence was MLKR. When the amino 
acid composition of the spot, as well as estimated pi and MW. were matched against all entries in SWLSS- 
PROT lor£. cnli. the a hove list of hest matches was produced. N- terminal sequences arc from S WISS-PROT 
for those entries. The top ranking identification of serine hydroxymethyltransfcrase (hold i did not show a 
large score difference hetwecn the first and second ranking proteins, giving little confidence in this heinc 
the correct protein identification. However, the sequence tag iM L K Ri confirmed the idcmitv oi the 
protein a> serine hydroxymethyltransfcrase. 

tryptophan are destroyed during hydrolysis, asparaginc and glutaminc arc dcamidated 
to their corresponding acids, and proline is not quaniitatcd in sonic analysis systems. 
The computer programs produce a list of best matching proteins, which arc ranked by 
a score that indicates the match quality. Some programs allow matching to be 
restricted to specific 'windows' of MW and pi (Hobohm. Houthaeve and Sander. 
1994: Wilkins et aL. 1995). and to protein database entries for one species (Jungblui 
a aL. 1 992: W ilkins a aL. 1 995 ). The use of such restrictions increases the power of 
matching. An example of protein identification by amino acid composition is shown 
in Figure 4. To date, amino acid composition has been used to identify proteins from 
reference maps of Spiroplaswa mclliferum. Mycoplasma i:eniialittm. E. coli. Saccha- 
romxees cerevisiac. Dicryostelium discoiiieum. human sera, human heart, human 
lymphocyte, and mouse brain (Cordwell et aL. 1995: Wasinger a aL. 1995: Wilkins 
eial.. 1995: Jungblui ei aL. 1992. 1994: Garrels ei aL. 1994: Freyc/ aL. 1994). 

PROTEIN IDENTIFICATION BY AMINO ACID COMPOSITION AND N-TERMINAL 
SEQUENCE TAG 

When samples from 2-D gels are not unambiguously identified by amino acid 
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c imposition, pi and \J\V. often the correct identification of that protein is anions, , he 

fjp rankings of the list iHobohm. Houihaeve and Sander. 1994: Cord well t -, (/ /.jij9v 

'A'ilkins a al.. 1 995 1. Taking advantage of this observation, .ve have used the mass 

^peciromcirx sequence tag - concept i.Mann and Wil n. 1994 1 in developine a com- * 

\ ined Edman degradation and amino acid analysis approach i.< protein identification 

i Wilkins a al.. submitted i. This involves the N-terniin.il sequencing ol'PYDF-hlotted 

proteins by Edman degradation for 3 or 4 cycles to create a sequence tac". following ''. 
which the same sample is used for amino acid analysis. As only a few amino acids are 
i amoved from the protein, its composition is not significant!- altered. Furthermore 

since onl> a small amount of protein sequence is required. fa>t but low repetitive yield ° 
r-Jman degradation cycles can be used. Modifications to current procedures should X 
allow 3 cycles to be completed in 1 h. thereby allowing the screenins of 100 or more 
proteins per week on one automated, multi-cartridge sequenator. Ammo acid compo- 
sition. pi and MW of proteins are matched against databases as described above, and 

N-terminal sequences of best matching proteins are checked with the sequence "m* - V ' 
io confirm the protein identity [Figure 5). This technique will be less useful when 
proteins are N-terminally blocked, but as only a few N-terminal amino acids are 
susceptible to the acetyl, formyl. or pyroslutamyl modifications that cause blockade, 
this may itself provide useful information for sequence tag identification. A strength 
of N-ierminal sequence tag and amino acid composition protein identification i> that P ' 
data generated are quickly and easily interpreted. m 
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PROTEIN IDENTIFICATION BY PEPTIDE MASS FINGERPRINTING 

Techniques for the identification of proteins by peptide mass fingerprinting have 
recently been described (Henzel ci al.. 1993: Pappin. Hojrup and" Blcasby.'" 1993- 
James ct al.. 1993: Mann. Hojrup and Roepstorff. 1993: Yates ct al.. 199V Mortz ct 
al.. 1994: Sutton ct al.. 1995.. This involves the generation of peptides from proteins * 
using residue-specific enzymes, the determination of peptide masses, and the match- 
ing of these masses against theoretical peptide libraries venerated from protein 
sequence databases. As proteins have different amino acid sequences, their peptides 
should produce characteristic 'fingerprints". 

The first step of peptide mass fingerprinting is protein digestion. Proteins within the — 
gel matrix or bound to PYDFcan be enzymatically digested//, v//„. allnou-h in .w we! A - 
digests arc reported to produce more enzyme autodigesiion products, which compli- 
cate subsequent peptide mass analysis (James ct a!.. 1993: Rasmussen a al 1994- 
Mora ci al.. 1994). The enzyme of choice for digestion is currentlv trvp M n (of 
modified sequencing grade ). but other enzymes < Lys-C or5. aureus VK protease » have 
also been used (Pappin. Hojrup and Bleasby. 1993). To maximise the number of 

peptides obtained, it is desirable for protein samples to be reduced and alkylated prior Hi 
to digestion (Mortz ct al.. 1994: Henzel ct al.. 1993,. This ensures that all disulfide " : 
bonds of the protein are broken, and produces protein conformations that are more '( 
amenable to digestion. Surprisingly, chemical digestion methods such as cvano-en Pl 
bromide (methionine specific), formic acid (aspartic acid specific) and "'-p 1 . P> 
n.trophenylsulfenyh-3-methyl-3-bromoindolenine (tryptophan specific, have not S '' 
been explored as means of peptide production for mass fincerprintinc. even though 
they are rapid and may circumvent some problems associated with enzvme diccsiions T 
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(Nikodem and Fresco. 1979; Crimmins ct aL. 1990: Vanfleieren ei aL. 1992) 

After pr^tein^ are digested, peptide masses are determined by mass sn-ciruiiieirv. 
Direct analy>iN of peprde mixtures can be achieved b\" elecirospray ionisaiion mass 
spectrometry . plasma d»'sorpiion mavs spectrometry . or matrix assisted laser dcsorption 
ionization (MALDI) m is< spectrometry techniques. MALDI is preferable because of 
its higher sensitivity anJ greater tolerance to contaminating substances from 2-D ccl> 
(James ei aL. 1993: Mcnzr; al.. 1994: Pappin. Hojrup and Bleasby. 199?). Further- 
more, recent modifications to sample preparation methods have largelv solved earlv 
difficulties experienced with the calibration of MALDI spectra (Monz ct aL. 1994 
Vorm and Mann. 199^; Vorm Roepstorff and Mann. 1994). The high scnsiiivitv of 
mass spectrometry allows a small fraction of a digest of a lug protein spot to be used 
for analysis, and analysis itself is complete in a few minutes. 

A major challenge associated with peptide mass fingerprinting is data interpretation 
prior to computer matching against libraries of theoretical peptide dicests. Spectra 
must be examined carefully to determine which peaks represent peptide masses of 
interest, as there are often enzyme autodigestion products and contaminating sub- 
stances present (Henzel ct al.. 1993: Mortz ct aL. 1994: Rasmussen ct aL. 1994). 
Furthermore, if protein alkylation and reduction has not been undertaken prior to 
protein digestion, peptide sequence coverage may be poor (409r to 70^r ). with some 
masses present representing disulfide bonded peptides originally present in the protein 
( Mortz ct aL. 1 994 ). For eukaryotes. a serious issue is the alteration of peptide masses 
by the presence of post-translational modifications (Table 6). The mass of the 
unmodified peptide alone can be very difficult to determine. Two artifactual modifi- 
cations introduced by electrophoresis, an acrvlamide adduct to cysteine and the 
oxidat ion of methionine, are also known lo alter peptide masses < le Maire ct aL. ] 993; 
Hess ctaL. 1993). 

Table 6: Masse* of some common post-translaiiona) modifications Peptides carmns: post- 
translatmnal modifications complicate data analysis for peptide mass ftmjcrpnniini: protein 
identification This iv especially so lor protein ylycos> latum, which in\ol\c\ main different 
combination* of the hexosamincs. hexo^cs. dcoxyhexoscv and sialic acid 



Pi tst-translntional modification 


Ma» change 


Acct\ lation 


* -12 U4 


'Acrvlamide adduct to csstcinc 


-71. (K) 


Carbo.w latum of A^p or Glu 


~ J4.oi 


Deamidation of A*;n or Gin 


- 0 


Disulfide bond lormation 


- : o: 


Deoxyhexosc* iFuei 


I4fi 14 


Formylation 


* :.s.oi 


Hexosamtnc* iGlcN. GalNi 


+ IM.Ift 


Hexosc* iGlc. Gal. Mam 


- 162.14 


H\ droxylation 


- 1MK) 


N- acetyl hex o<ami ncs fGlcNAc. GalNAci 


- 203.19 


"Oxidation of Met 


- KvOO 


Phosphorylation 


«► mm 


P> roijlutamic acid formed trom Gin 


-17.03 


Sialic acid (NeuNAo 


♦ 2VI.26 


Sulfation 


-f X0.06 



T.ihic mtnJificd trom Finnitan LASER MAT application data vhect 5. 

Asterisk " <ruu\< modifications that can arise artifactual l> from ihc 2-D elccirophorcMv process. 
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A number of computer programs are available for matching peptide masses acainst 
databases (reviewed in Cottrell. 1994). Matching is usually undertaken in an interac- 
tive manner, whereby peaks of mass 500-3000 Da are selected and matched under 
• various search parameters including M\V of protein, mass accuracy of peptides, and 
number of missed enzyme cleavages allowed (Henzd ct al. 1993; Monz cv al.. 1994; 
Rasmussen et al. 1 994 ). The correct protein identity is the protein which has the most 
peptide masses in common with the unknown sample. Identities have been established 
with as few as three peptides, but unambiguous identification is thought to require a 
mass spectrometric map covering most peptides of the protein (Monz et al.. 
Yates et al.. 1993). To date, peptide mass fingerprintirg of proteins has been 
undertaken from the human myocardial protein and keratinocyte maps, from an£. coli 
2-D eel. and from reference maps of Spimplasma meltiun 'utfi and M\ ( (tpla.\nia 
yenitalium (Suiionetal. 1995; Rasmussen etal, 1994; Henzel etal. 1993: Cordwcll 
et at.. 1995. Wasinger et al. 1995). although the technique is most powerful when 
used in combination with another protein identification technique (Rasmussen ct <//.. 
1994; Cordwcll et al.. 1995). 

MASS SPECTROMETRY SEQUENCE TAGGING 

An extension of peptide mass fingerprinting has recently been described, called 
peptide sequence tagging (Mann and Wilm. 1994; Mann. 1995). This uses tandem 
mass spectrometry (MS/MS) to initially determine the mass of peptides, then subject 
them to fragmentation by collision with a gas. and finally determine the mass of 
fragments. The resulting spectra gives information about a peptide's amino acid 
sequence. The fragmentation masses of peptides can rarely be used to assign a complete 
sequence, but it usually allows a short 'sequence tag' of 2 or 3 amino acids to be 
determined. This sequence tag and the original peptide mass is matched by computer 
against a database, providing a likely identity of the peptide and the protein it came from. 
The major drawback for this technique as a mass screening tool is the complexity of the 
mass data generated and the high level of expertise required for it^ interpretation. 
Nevertheless, it represents a useful new protein identification method which greatly 
increases the power of peptide mass fingerprinting protein identification. 

Cross-species protein identification 

Protein sequence databases continue to grow at a rapid rate, yet it N not widelv 
appreciated that close to 909r of all information contained in current protein databases 
comes from only 1 0 species ( A. Bairoch. Pers. Comm. j. Fortunately, this information 
can he used to study proteomes of organisms that arc poorly defined at the molecular 
level, via 2-D electrophoresis and 'cross-species' protein identification (Cordwcll ct 
al.. 1 995; Wasinger et al.. 1995). This approach allows proteins from reference maps 
of many different species to be identified without the need for the corresponding <jencs 
to be cloned and sequenced. This is particularly true for 'housekeeping' proteins, such 
as enzymes involved in glycolysis. DNA manipulation and protein manufacture, 
which are highly conserved across species boundaries. Proteins that cannot be 
identified across species boundaries can then become the focus of further protein 
characterisation and DNA sequencing efforts. 
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. S30947 
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APOLIPOPROTEIN A- I (APO-AI). 
APOLIPOPROTEIN A-I (APO-AI). 
APOLIPOPROTEIN A-I (APO-AI) . 
orf B - Treponema denticola 
APOLIPOPROTEIN A-I (APO-AI). 



HOMO SAPIENS 
MACACA FASCICULARIS 
PAP 10 HAMADRYAS 

CANIS FAMILIARIS (DOG) 



hypothetical protein 1 - Azotobacter vinelandii 
CHLOROPLAST HEAT SHOCK PROTEIN PRECURSOR . - PISUM SATIVU 
Tropomyosin - African clawed frog 

HIWI354 premature term, at 793 - Human immunodeficiency 
TRAJ PROTEIN. - ESCHERICHIA COLI . 

Figure 6. Theoretical cross-species matching of humnn apolipoprotein A-I hy amino actd composition 
and tryptic peptides. When an unknown protein is analysed, best ranking proteins I mm both techniques can 
he compared. If the same protein type is observed in both lists, there is high confidence in this being the 
identity of the unknown molecule (Cordwell cr ai. 1995). (Al Output of ExPASy server (Appcl. Batroch 
and Hochstrasser. 1994) where the true amino acid composition of apolipoproicin A-I was matched aeainst 
all entries in the SWISS-PROT database, without pi or MW windows. Seven of the lop 10 matching 
proteins were apolipoprotein A-i of different species. (B) Output of MOWSE peptide mass fingerprinting 
program (Pappm. Hojrupand Bleasby. 1 993 > where true tryptic peptides of human apolipoprotein A-I were 
matched against the OWL database, using MW window of I09r . Four of the top ten matching proteins were 
apolipoprotein A-I from different species. 
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Rapid cross-species identification of proteins f rom 2-D reference maps can he 
undertaken with amino acid composition or peptide mass fingerprinting method 
i Figure 6). but these 'echniques alone ma> not identify proteins unambiguously when 
phvlogenetic cross-aptwies distances are g- eat or analysis data is of poor quality ( Yntt»< 
et aL. 199?: Shaw. 1993; Cordwell ei aL. 1995). However, very high confidence in 
protein identities can be achieved when 1 i -its of best-matching proteins generated by 
both techniques are compared (Cordwell ei aL. 1995: Wasinger ei aL. 1995). The 
correct identification is found when the same protein is ranked highly in lists of best 
matches generated by both techniques. This method has allowed approximately 120 
proteins from the reference map of the mollicute Spimplasnw mcllifenmu represent- 
ing approximately one quarter of the proteomc. to be confidently identified by 
reference to protein information from other species <S. Cordwell. Personal Communi- 
cation). When cross-species protein identification is to be undertaken, it should be 
noted that the molecular weight of a protein type across species is usually highly 
conserved, but that protein pi can van* by more than 2 units (Cordwell ei aL. 1995). 
Accurate molecular weight determination by direct mass spectrometry of proteins 
blotted to PVDF (Eckerskom et aL. 1992) should therefore be a useful additional 
parameter for cross-species protein identification. 

CHARACTERISATION OF POST-TRANSLATIONAL MODIFICATIONS 

Manv proteins are modified after translation. Such post-translational modifications, 
including glycosylation. phosphorylation, and sulfation (see Table 6). are usually 
necessary for protein function or stability. Some abnormal modifications are associ- 
ated with disease (Duthel and Revol. 1993: Ghosh et aL. 1993: Yamashita et aL. 
1993). In proteome studies, post-translational modifications can be examined on all 
proteins present, or on individual spots. Studies on all proteins provide an indication 
of which proteins may earn* a certain type of modification. For example. 2-D gel 
analysis of cell cultures grown in the presence of pH] mannose or ["PI phosphate 
gives an indication of which proteins earn* glycans containing mannose. and which 
proteins are phosphorylated (Garrels and Franza. 1989). Lectin binding studies of 2-D 
eels blotted to PVDF or nitrocellulose provide information on the saccharides, if any. 
that are carried by proteins present (Gravel et aL. 1994). 

When individual proteins of interest carrying post-translational modifications have 
been found, micropreparaiive 2-D electrophoresis can be used to purify them in 
microgram quantities (Hanash et aL. 1991: Bjellqvist et aL. 1993b). If protein 
isoforms of similar MW and pi are to be studied, focusing with narrow range pi 
gradients (1 pH unit) can provide greater separation and resolution. After electro- 
phoresis, the tvpe and degree of protein phosphorylation can be investigated (Murthy 
and Iqbal. 1991: Gold et aL. 1994). monosaccharide composition can be determined 
i Weitzhandler et aL. 1993: Packer et aL. 1995). and the structure and exact site of 
elvcoamino acids can be investigated by either Edman degradation based techniques 
or by mass spectrometry (Pisano et aL. 1993: Huberty et aL. 1993: Carr. Huddleston 
and Bean. 1993). With further development of rapid techniques, investigation of 
phosphorylation and monosaccharides by chromatographic or mass spectrometric 
means is likely to become a routine step in the characterisation of post-translational 
modifications of proteins from reference maps. 
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The status of proteome projects 

Many technical aspects of proteome research have already been discussed in this 
review, hut an overview of the status of proteome projects has not yet been presented. 
Advances in proteome projects will initially rely on progress in genome sequencing 
i nitiatives, to enable an identity, amino acid sequence, or function to be assigned to 
each proiein spot. Table 7 shows genome size, proieome size, and ihe number of 
proteins already defined for a number of model organisms. This indicates thai whilst 
genome sequencing programs for £. coli and 5. cerevisiae are advanced, the massive 
size of .ome other genomes (and especially the human genome) means that their 
complet" nucleotide sequences are unlikely to be available for many years. Because of 
this. 2-D ieference maps and proieome projects of single cell organisms like M\a>* 
plasma sp.. E. coli and S. cerevisiae will be the most detailed (Cordwell et aL. 1995; 
Wasinger et ai. 1995; Vanbogelen et aL. 1992; Garrels et al.. 1994). and complete 
maps of other organisms will take longer to construct. However, the use of cross- 
species protein identification techniques will allow proteomes of manv prokarvotes 
and simple eukaryotes to be panially defined in reference to E. coli and S. c erevisiae. 

Table 7: Estimated genome size, estimated proieome size, number of proiein sequences in SWISS- 
PROT Release 31 (March. 1995). and approximate number of proieins of known identity on 2-D 
reference map< for some model organisms. Genome size data from Smith ( 1994). and total protein data 
from Bird ( 1995). Genome sequencing protects of £. coli and S. cerevisiae \\ ill prohahlv be complete in 
1996. 
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Mycoplasma species 
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> 1(H) 


Escherichia coli 


4.8 
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> 3(X) 
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13.5 
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> MK) 
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70 


I25O0 
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70 
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Caen to luibditt s cl cyans 
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2900 


60000-80000 
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The study of vertebrate proteomes and vertebrate development is a phenomenal 
undertaking in comparison to the investigation of single cell organisms. This is 
because vast numbers of proteins are developmental!}' expressed, each body tissue has 
hundreds of unique proteins, and there are numerous tissue types. However, it is 
estimated that at least 359r of proteins in vertebrate cells will be conserved from tissue 
to tissue, constituting the housekeeping" proteins (Bird. 1995). with the remainder of 
proteins constituting a set that are specific 10 u cell type. Providing that standardised 
electrophoretic conditions are used, reference maps from many tissues of one organ- 
ism can be superimposed in gel databases (e.g. Hochstrasser et al.. 1992). This 
accelerates the definition of the housekeeping' proteins, as well as sets of proteins that 
are unique to different tissue types. Such studies may, however, be complicated by 
post-translational modifications, which can differ on the same gene product in 
different tissues. Proteins that remain unknown after identification procedures will be 
useful in providing focus for nucleic acid sequencing initiatives. 
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FUTURE DIRECTIONS OF PROTTONih PROJECTS 

This review has described recem advances in ihe area of proieome research. Ii has 
illustrated hou new development.' of oldet techniques t2-D electrophoresis. ;»nd amino 
acid analysis ) as well as the applications 01 new technology ( mass specirometrv » have 
greatly widened the choice of tools the biologist and protein chemist ha> for the 
separation, identification and analysis of complex mixtures of proteins. This has made 
possible the establishment of detailed reference maps for organisms, which are 
becoming the method of choice for the definition of tissue* or whole cells, and the 
investigation of gene expression therein. 

Proteome projects are already impacting on the dogma of molecular biolocv that 
DN A sequence constitutes the definition 01 an organism. For example, the proteomes 
of different tissues of a single organism are often significantly different. Similarlv. 
cross-species identification of proteins (for example the identification of proteins 
from Candida albicans by comparison with S. cerevisiae) can open up studies on 
organisms that are poorly molecularly defined. As cross-species identification can 
proceed at a pace orders of magnitude faster than a genome project in terms of 
defining the gene and protein complement of organims. the need for the DNA 
sequencing of genomes will be avoided, and emphasis. placed on those found to be 
novel. 

Just as genome sequencing is not an end in itself, neither is an annotated 2-D protein 
reference map of an organism, nor indeed the identification of proteins in a proteome. 
So whilst an immediate aim of proteome projects is to screen proteins in reference 
maps, this will lead to expression studies and characterisation of post-translational 
modifications. The challenge that then needs to be addressed is the investigation of 
structure and function of proteins in a proteome. The magnitude of this is illustrated by 
the fact that over half the open reading frames identified in 5. cerevisiae chromosome 
III were initially of no known function (Oliver et aL. 1992). Structural and functional 
studies will be an undertaking just as formidable as genome studies are now and 
proteome projects are becoming, but will lead to an unimaginably detailed under- 
standing of how living organisms are constructed and how they operate. 
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